NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

The Geometry of Concepts: Sparse Autoencoder Feature Structure

https://doi.org/10.3390/e27040344

Li, Yuxiao; Michaud, Eric J; Baek, David D; Engels, Joshua; Sun, Xiaoqing; Tegmark, Max (March 2025, Entropy)

Sparse autoencoders have recently produced dictionaries of high-dimensional vectors corresponding to the universe of concepts represented by large language models. We find that this concept universe has interesting structure at three levels: (1) The “atomic” small-scale structure contains “crystals” whose faces are parallelograms or trapezoids, generalizing well-known examples such as (man:woman::king:queen). We find that the quality of such parallelograms and associated function vectors improves greatly when projecting out global distractor directions such as word length, which is efficiently performed with linear discriminant analysis. (2) The “brain” intermediate-scale structure has significant spatial modularity; for example, math and code features form a “lobe” akin to functional lobes seen in neural fMRI images. We quantify the spatial locality of these lobes with multiple metrics and find that clusters of co-occurring features, at coarse enough scale, also cluster together spatially far more than one would expect if feature geometry were random. (3) The “galaxy”-scale large-scale structure of the feature point cloud is not isotropic, but instead has a power law of eigenvalues with steepest slope in middle layers. We also quantify how the clustering entropy depends on the layer.
more » « less
Free, publicly-accessible full text available March 27, 2026
Not All Language Model Features Are Linear

Engels, Joshua; Liao, Isaac; Michaud, Eric J; Gurnee, Wes; Tegmark, Max (January 2025, Open Review)

Free, publicly-accessible full text available January 22, 2026
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models

Marks, Samuel; Rager, Can; Michaud, Eric J; Belinkov, Yonatan; Bau, David; Mueller, Aaron (January 2025, Open Review)

Free, publicly-accessible full text available January 22, 2026
Opening the AI Black Box: Distilling Machine-Learned Algorithms into Code

https://doi.org/10.3390/e26121046

Michaud, Eric J; Liao, Isaac; Lad, Vedang; Liu, Ziming; Mudide, Anish; Loughridge, Chloe; Guo, Zifan Carl; Kheirkhah, Tara Rezaei; Vukelić, Mateja; Tegmark, Max (December 2024, Entropy)

Can we turn AI black boxes into code? Although this mission sounds extremely challenging, we show that it is not entirely impossible by presenting a proof-of-concept method, MIPS, that can synthesize programs based on the automated mechanistic interpretability of neural networks trained to perform the desired task, auto-distilling the learned algorithm into Python code. We test MIPS on a benchmark of 62 algorithmic tasks that can be learned by an RNN and find it highly complementary to GPT-4: MIPS solves 32 of them, including 13 that are not solved by GPT-4 (which also solves 30). MIPS uses an integer autoencoder to convert the RNN into a finite state machine, then applies Boolean or integer symbolic regression to capture the learned algorithm. As opposed to large language models, this program synthesis technique makes no use of (and is therefore not limited by) human training data such as algorithms and code from GitHub. We discuss opportunities and challenges for scaling up this approach to make machine-learned models more interpretable and trustworthy.
more » « less
Full Text Available
Efficient Dictionary Learning with Switch Sparse Autoencoders

Mudide, Anish; Engels, Joshua; Michaud, Eric J; Tegmark, Max; Schroeder_de_Witt, Christian (January 2024, Open Review)

Full Text Available

Search for: All records